.TH SANLOCK 8 2015-01-23

.SH NAME
sanlock \- shared storage lock manager

.SH SYNOPSIS
.B sanlock
[COMMAND] [ACTION] ...

.SH DESCRIPTION

sanlock is a lock manager built on shared storage.  Hosts with access to
the storage can perform locking.  An application running on the hosts is
given a small amount of space on the shared block device or file, and uses
sanlock for its own application-specific synchronization.  Internally, the
sanlock daemon manages locks using two disk-based lease algorithms: delta
leases and paxos leases.

.IP \[bu] 2
.I delta leases
are slow to acquire and demand regular i/o to shared
storage.  sanlock only uses them internally to hold a lease on its
"host_id" (an integer host identifier from 1-2000).  They prevent two
hosts from using the same host identifier.  The delta lease renewals also
indicate if a host is alive.  ("Light-Weight Leases for Storage-Centric
Coordination", Chockler and Malkhi.)

.IP \[bu]
.I paxos leases
are fast to acquire and sanlock makes them available to
applications as general purpose resource leases.  The disk paxos
algorithm uses host_id's internally to represent different hosts, and
the owner of a paxos lease.  delta leases provide unique host_id's for
implementing paxos leases, and delta lease renewals serve as a proxy for
paxos lease renewal.  ("Disk Paxos", Eli Gafni and Leslie Lamport.)

.P

Externally, the sanlock daemon exposes a locking interface through
libsanlock in terms of "lockspaces" and "resources".  A lockspace is a
locking context that an application creates for itself on shared storage.
When the application on each host is started, it "joins" the lockspace.
It can then create "resources" on the shared storage.  Each resource
represents an application-specific entity.  The application can acquire
and release leases on resources.

To use sanlock from an application:

.IP \[bu] 2
Allocate shared storage for an application,
e.g. a shared LUN or LV from a SAN, or files from NFS.

.IP \[bu]
Provide the storage to the application.

.IP \[bu]
The application uses this storage with libsanlock to create a lockspace
and resources for itself.

.IP \[bu]
The application joins the lockspace when it starts.

.IP \[bu]
The application acquires and releases leases on resources.

.P

How lockspaces and resources translate to delta leases and paxos leases
within sanlock:

.I Lockspaces

.IP \[bu] 2
A lockspace is based on delta leases held by each host using the lockspace.

.IP \[bu]
A lockspace is a series of 2000 delta leases on disk, and requires 1MB of storage.
(See Storage below for size variations.)

.IP \[bu]
A lockspace can support up to 2000 concurrent hosts using it, each
using a different delta lease.

.IP \[bu]
Applications can i) create, ii) join and iii) leave a lockspace, which
corresponds to i) initializing the set of delta leases on disk,
ii) acquiring one of the delta leases and iii) releasing the delta lease.

.IP \[bu]
When a lockspace is created, a unique lockspace name and disk location
is provided by the application.

.IP \[bu]
When a lockspace is created/initialized, sanlock formats the sequence of
2000 on-disk delta lease structures on the file or disk,
e.g. /mnt/leasefile (NFS) or /dev/vg/lv (SAN).

.IP \[bu]
The 2000 individual delta leases in a lockspace are identified by
number: 1,2,3,...,2000.

.IP \[bu]
Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
its number, e.g. delta lease 1 is offset 0, delta lease 2 is offset 512,
delta lease 2000 is offset 1023488.  (See Storage below for size variations.)

.IP \[bu]
When an application joins a lockspace, it must specify the lockspace
name, the lockspace location on shared disk/file, and the local host's
host_id.  sanlock then acquires the delta lease corresponding to the
host_id, e.g. joining the lockspace with host_id 1 acquires delta lease 1.

.IP \[bu]
The terms delta lease, lockspace lease, and host_id lease are used
interchangably.

.IP \[bu]
sanlock acquires a delta lease by writing the host's unique name to the
delta lease disk sector, reading it back after a delay, and verifying
it is the same.

.IP \[bu]
If a unique host name is not specified, sanlock generates a uuid to use
as the host's name.  The delta lease algorithm depends on hosts using
unique names.

.IP \[bu]
The application on each host should be configured with a unique host_id,
where the host_id is an integer 1-2000.

.IP \[bu]
If hosts are misconfigured and have the same host_id, the delta lease
algorithm is designed to detect this conflict, and only one host will
be able to acquire the delta lease for that host_id.

.IP \[bu]
A delta lease ensures that a lockspace host_id is being used by a single
host with the unique name specified in the delta lease.

.IP \[bu]
Resolving delta lease conflicts is slow, because the algorithm is based
on waiting and watching for some time for other hosts to write to the
same delta lease sector.  If multiple hosts try to use the same delta
lease, the delay is increased substantially.  So, it is best to configure
applications to use unique host_id's that will not conflict.

.IP \[bu]
After sanlock acquires a delta lease, the lease must be renewed until
the application leaves the lockspace (which corresponds to releasing the
delta lease on the host_id.)

.IP \[bu]
sanlock renews delta leases every 20 seconds (by default) by writing a
new timestamp into the delta lease sector.

.IP \[bu]
When a host acquires a delta lease in a lockspace, it can be referred to
as "joining" the lockspace.  Once it has joined the lockspace, it can
use resources associated with the lockspace.

.P

.I Resources

.IP \[bu] 2
A lockspace is a context for resources that can be locked and unlocked
by an application.

.IP \[bu]
sanlock uses paxos leases to implement leases on resources.  The terms
paxos lease and resource lease are used interchangably.

.IP \[bu]
A paxos lease exists on shared storage and requires 1MB of space.
It contains a unique resource name and the name of the lockspace.

.IP \[bu]
An application assigns its own meaning to a sanlock resource and the
leases on it.  A sanlock resource could represent some shared object
like a file, or some unique role among the hosts.

.IP \[bu]
Resource leases are associated with a specific lockspace and can only be
used by hosts that have joined that lockspace (they are holding a
delta lease on a host_id in that lockspace.)

.IP \[bu]
An application must keep track of the disk locations of its lockspaces
and resources.  sanlock does not maintain any persistent index or
directory of lockspaces or resources that have been created by
applications, so applications need to remember where they have placed
their own leases (which files or disks and offsets).

.IP \[bu]
sanlock does not renew paxos leases directly (although it could).
Instead, the renewal of a host's delta lease represents the renewal of
all that host's paxos leases in the associated lockspace. In effect,
many paxos lease renewals are factored out into one delta lease renewal.
This reduces i/o when many paxos leases are used.

.IP \[bu]
The disk paxos algorithm allows multiple hosts to all attempt to
acquire the same paxos lease at once, and will produce a single
winner/owner of the resource lease.  (Shared resource leases are also
possible in addition to the default exclusive leases.)

.IP \[bu]
The disk paxos algorithm involves a specific sequence of reading and
writing the sectors of the paxos lease disk area.  Each host has a
dedicated 512 byte sector in the paxos lease disk area where it writes its
own "ballot", and each host reads the entire disk area to see the ballots
of other hosts.  The first sector of the disk area is the "leader record"
that holds the result of the last paxos ballot.  The winner of the paxos
ballot writes the result of the ballot to the leader record (the winner of
the ballot may have selected another contending host as the owner of the
paxos lease.)

.IP \[bu]
After a paxos lease is acquired, no further i/o is done in the paxos
lease disk area.

.IP \[bu]
Releasing the paxos lease involves writing a single sector to clear the
current owner in the leader record.

.IP \[bu]
If a host holding a paxos lease fails, the disk area of the paxos lease
still indicates that the paxos lease is owned by the failed host.  If
another host attempts to acquire the paxos lease, and finds the lease is
held by another host_id, it will check the delta lease of that host_id.
If the delta lease of the host_id is being renewed, then the paxos lease
is owned and cannot be acquired.  If the delta lease of the owner's host_id
has expired, then the paxos lease is expired and can be taken (by
going through the paxos lease algorithm.)

.IP \[bu]
The "interaction" or "awareness" between hosts of each other is limited
to the case where they attempt to acquire the same paxos lease, and need
to check if the referenced delta lease has expired or not.

.IP \[bu]
When hosts do not attempt to lock the same resources concurrently, there
is no host interaction or awareness.  The state or actions of one host
have no effect on others.

.IP \[bu]
To speed up checking delta lease expiration (in the case of a paxos
lease conflict), sanlock keeps track of past renewals of other delta
leases in the lockspace.

.P

.I Resource Index

The resource index (rindex) is an optional sanlock feature that
applications can use to keep track of resource lease offsets.  Without the
rindex, an application must keep track of where its resource leases exist
on disk and find available locations when creating new leases.

The sanlock rindex uses two align-size areas on disk following the
lockspace.  The first area holds rindex entries; each entry records a
resource lease name and location.  The second area holds a private paxos
lease, used by sanlock internally to protect rindex updates.

The application creates the rindex on disk with the "format" function.
Format is a disk-only operation and does not interact with the live
lockspace, so it can be called without first calling add_lockspace.  The
application needs to follow the convention of writing the lockspace at the
start of the device (offset 0) and formatting the rindex immediately
following the lockspace area.  When formatting, the application must set
flags for sector size and align size to match those for the lockspace.

To use the rindex, the application:

.IP \[bu] 2
Uses the "create" function to create a new resource lease on disk.  This
takes the place of the write_resource function.  The create function
requires the location of the rindex and the name of the new resource
lease.  sanlock finds a free lease area, writes the new resource lease at
that location, updates the rindex with the name:offset, and returns the
offset to the caller.  The caller uses this offset when acquiring the
resource lease.

.IP \[bu]
Uses the "delete" function to remove a resource disk on disk (also
corresponding to the write_resource function.)  sanlock clears the
resource lease and the rindex entry for it.  A subsequent call to create
may use this same disk location for a different resource lease.

.IP \[bu]
Uses the "lookup" function to discover the offset of a resource lease
given the resource lease name.  The caller would typically call this prior
to acquiring the resource lease.

.IP \[bu]
Uses the "rebuild" function to recreate the rindex if it is damaged or
becomes inconsistent.  This function scans the disk for resource leases
and creates new rindex entries to match the leases it finds.

.IP \[bu]
The "update" function manipulates rindex entries directly and should not
normally be used by the application.  In normal usage, the create and
delete functions manipulate rindex entries.  Update is mainly useful for
testing or repairs.

.P

.I Expiration

.IP \[bu] 2
If a host fails to renew its delta lease, e.g. it looses access to the
storage, its delta lease will eventually expire and another host will be
able to take over any resource leases held by the host.  sanlock must
ensure that the application on two different hosts is not holding and
using the same lease concurrently.

.IP \[bu]
When sanlock has failed to renew a delta lease for a period of time, it
will begin taking measures to stop local processes (applications) from
using any resource leases associated with the expiring lockspace delta
lease.  sanlock enters this "recovery mode" well ahead of the time when
another host could take over the locally owned leases.  sanlock must have
sufficient time to stop all local processes that are using the expiring
leases.

.IP \[bu]
sanlock uses three methods to stop local processes that are using
expiring leases:

1. Graceful shutdown.  sanlock will execute a "graceful shutdown" program
that the application previously specified for this case.  The shutdown
program tells the application to shut down because its leases are
expiring.  The application must respond by stopping its activities and
releasing its leases (or exit).  If an application does not specify a
graceful shutdown program, sanlock sends SIGTERM to the process instead.
The process must release its leases or exit in a prescribed amount of time
(see -g), or sanlock proceeds to the next method of stopping.  

2. Forced shutdown.  sanlock will send SIGKILL to processes using the
expiring leases.  The processes have a fixed amount of time to exit after
receiving SIGKILL.  If any do not exit in this time, sanlock will proceed
to the next method.

3. Host reset.  sanlock will trigger the host's watchdog device to
forcibly reset it.  sanlock carefully manages the timing of the watchdog
device so that it fires shortly before any other host could take over the
resource leases held by local processes.

.P

.I Failures

If a process holding resource leases fails or exits without releasing its
leases, sanlock will release the leases for it automatically (unless
persistent resource leases were used.)

If the sanlock daemon cannot renew a lockspace delta lease for a specific
period of time (see Expiration), sanlock will enter "recovery mode" where
it attempts to stop and/or kill any processes holding resource leases in
the expiring lockspace.  If the processes do not exit in time, sanlock
will force the host to be reset using the local watchdog device.

If the sanlock daemon crashes or hangs, it will not renew the expiry time
of the per-lockspace connections it had to the wdmd daemon.  This will
lead to the expiration of the local watchdog device, and the host will be
reset.

.I Watchdog

sanlock uses the wdmd(8) daemon to access /dev/watchdog.  wdmd multiplexes
multiple timeouts onto the single watchdog timer.  This is required
because delta leases for each lockspace are renewed and expire
independently.

sanlock maintains a wdmd connection for each lockspace delta lease being
renewed.  Each connection has an expiry time for some seconds in the
future.  After each successful delta lease renewal, the expiry time is
renewed for the associated wdmd connection.  If wdmd finds any connection
expired, it will not renew the /dev/watchdog timer.  Given enough
successive failed renewals, the watchdog device will fire and reset the
host.  (Given the multiplexing nature of wdmd, shorter overlapping renewal
failures from multiple lockspaces could cause spurious watchdog firing.)

The direct link between delta lease renewals and watchdog renewals
provides a predictable watchdog firing time based on delta lease renewal
timestamps that are visible from other hosts.  sanlock knows the time the
watchdog on another host has fired based on the delta lease time.
Furthermore, if the watchdog device on another host fails to fire when it
should, the continuation of delta lease renewals from the other host will
make this evident and prevent leases from being taken from the failed
host.

If sanlock is able to stop/kill all processing using an expiring
lockspace, the associated wdmd connection for that lockspace is removed.
The expired wdmd connection will no longer block /dev/watchdog renewals,
and the host should avoid being reset.

.I Storage

The sector size and the align size should be specified when creating
lockspaces and resources (and rindex).  The "align size" is the size on
disk of a lockspace or a resource, i.e. the amount of disk space it uses.
Lockspaces and resources should use matching sector and align sizes, and
must use offsets in multiples of the align size.  The max number of hosts
that can use a lockspace or resource depends on the combination of sector
size and align size, shown below.  The host_id of hosts using the
lockspace can be no larger than the max_hosts value for the lockspace.

Accepted combinations of sector size and align size, and the corresponding
max_hosts (and max host_id) are:

sector_size 512, align_size 1M, max_hosts 2000
.br
sector_size 4096, align_size 1M, max_hosts 250
.br
sector_size 4096, align_size 2M, max_hosts 500
.br
sector_size 4096, align_size 4M, max_hosts 1000
.br
sector_size 4096, align_size 8M, max_hosts 2000

When sector_size and align_size are not specified, the behavior matches
the behavior before these sizes could be configured: on devices which
report sector size 512, 512/1M/2000 is used, on devices which report
sector size 4096, 4096/8M/2000 is used, and on files, 512/1M/2000 is
always used.  (Other combinations are not compatible with sanlock version
3.6 or earlier.)

Using sanlock on shared block devices that do host based mirroring or
replication is not likely to work correctly.  When using sanlock on shared
files, all sanlock io should go to one file server.

.I Example

This is an example of creating and using lockspaces and resources from the
command line.  (Most applications would use sanlock through libsanlock
rather than through the command line.)

.IP 1. 4
Allocate shared storage for sanlock leases.

This example assumes 512 byte sectors on the device, in which case the
lockspace needs 1MB and each resource needs 1MB.

The example shared block device accessible to all hosts is /dev/leases.

.IP 2. 4
Start sanlock on all hosts.

The -w 0 disables use of the watchdog for testing.

.nf
# sanlock daemon -w 0
.fi

.IP 3. 4
Start a dummy application on all hosts.

This sanlock command registers with sanlock, then execs the sleep command
which inherits the registered fd.  The sleep process acts as the dummy
application.  Because the sleep process is registered with sanlock, leases
can be acquired for it.

.nf
# sanlock client command -c /bin/sleep 600 &
.fi

.IP 4. 4
Create a lockspace for the application (from one host).

The lockspace is named "test".

.nf
# sanlock client init -s test:0:/dev/leases:0
.fi

.IP 5. 4
Join the lockspace for the application.

Use a unique host_id on each host.

.nf
host1:
# sanlock client add_lockspace -s test:1:/dev/leases:0
host2:
# sanlock client add_lockspace -s test:2:/dev/leases:0
.fi

.IP 6. 4
Create two resources for the application (from one host).

The resources are named "RA" and "RB".  Offsets are used on the same
device as the lockspace.  Different LVs or files could also be used.

.nf
# sanlock client init -r test:RA:/dev/leases:1048576
# sanlock client init -r test:RB:/dev/leases:2097152
.fi

.IP 7. 4
Acquire resource leases for the application on host1.

Acquire an exclusive lease (the default) on the first resource, and a
shared lease (SH) on the second resource.

.nf
# export P=`pidof sleep`
# sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
# sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
.fi

.IP 8. 4
Acquire resource leases for the application on host2.

Acquiring the exclusive lease on the first resource will fail because it
is held by host1.  Acquiring the shared lease on the second resource will
succeed.

.nf
# export P=`pidof sleep`
# sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P
# sanlock client acquire -r test:RB:/dev/leases:2097152:SH -p $P
.fi

.IP 9. 4
Release resource leases for the application on both hosts.

The sleep pid could also be killed, which will result in the sanlock
daemon releasing its leases when it exits.

.nf
# sanlock client release -r test:RA:/dev/leases:1048576 -p $P
# sanlock client release -r test:RB:/dev/leases:2097152 -p $P
.fi

.IP 10. 4
Leave the lockspace for the application.

.nf
host1:
# sanlock client rem_lockspace -s test:1:/dev/leases:0
host2:
# sanlock client rem_lockspace -s test:2:/dev/leases:0
.fi

.IP 11. 4
Stop sanlock on all hosts.

.nf
# sanlock shutdown
.fi


.SH OPTIONS

.P
COMMAND can be one of three primary top level choices
.P
.BR "sanlock daemon" " start daemon"
.br
.BR "sanlock client" " send request to daemon (default command if none given)"
.br
.BR "sanlock direct" " access storage directly (no coordination with daemon)"

.SS Daemon Command

.BR "sanlock daemon" " [options]"

.BR -D "    "
no fork and print all logging to stderr

.BR -Q " 0|1"
quiet error messages for common lock contention

.BR -R " 0|1"
renewal debugging, log debug info for each renewal

.BI -L " pri"
write logging at priority level and up to logfile (-1 none)

.BI -S " pri"
write logging at priority level and up to syslog (-1 none)

.BI -U " uid"
user id

.BI -G " gid"
group id

.BI -H " num"
renewal history size

.BI -t " num"
max worker threads

.BI -g " sec"
seconds for graceful recovery

.BR -w " 0|1"
use watchdog through wdmd

.BR -h " 0|1"
use high priority (RR) scheduling

.BI -l " num"
use mlockall (0 none, 1 current, 2 current and future)

.BI -b " sec"
seconds a host id bit will remain set in delta lease bitmap

.BI -e " str"
local host name used in delta leases

./" non-aio is untested and may not work
./" .BR \-a " 0|1"
./" use async i/o

.SS Client Command

.B "sanlock client"
.I action
[options]

.B sanlock client status

Print processes, lockspaces, and resources being managed by the sanlock
daemon.  Add -D to show extra internal daemon status for debugging.
Add -o p to show resources by pid, or -o s to show resources by lockspace.

.B sanlock client host_status

Print state of host_id delta leases read during the last renewal.
State of all lockspaces is shown (use -s to select one).
Add -D to show extra internal daemon status for debugging.

.B sanlock client gets

Print lockspaces being managed by the sanlock daemon.  The LOCKSPACE
string will be followed by ADD or REM if the lockspace is currently being
added or removed.  Add -h 1 to also show hosts in each lockspace.

.BR "sanlock client renewal -s" " LOCKSPACE"

Print a history of renewals with timing details.
See the Renewal history section below.

.B sanlock client log_dump

Print the sanlock daemon internal debug log.

.B sanlock client shutdown

Ask the sanlock daemon to exit.  Without the force option (-f 0), the
command will be ignored if any lockspaces exist.  With the force option
(-f 1), any registered processes will be killed, their resource leases
released, and lockspaces removed.  With the wait option (-w 1), the
command will wait for a result from the daemon indicating that it has
shut down and is exiting, or cannot shut down because lockspaces
exist (command fails).

.BR "sanlock client init -s" " LOCKSPACE"

Tell the sanlock daemon to initialize a lockspace on disk.  The -o option
can be used to specify the io timeout to be written in the host_id leases.
The -Z and -A options can be used to specify the sector size and align
size, and both should be set together. 
(Also see sanlock direct init.)

.BR "sanlock client init -r" " RESOURCE"

Tell the sanlock daemon to initialize a resource lease on disk.
The -Z and -A options can be used to specify the sector size and align
size, and both should be set together.
(Also see sanlock direct init.)

.BR "sanlock client read -s" " LOCKSPACE"

Tell the sanlock daemon to read a lockspace from disk.  Only the
LOCKSPACE path and offset are required.  If host_id is zero, the first
record at offset (host_id 1) is used.  The complete LOCKSPACE is printed.
Add -D to print other details.
(Also see sanlock direct read_leader.)

.BR "sanlock client read -r" " RESOURCE"

Tell the sanlock daemon to read a resource lease from disk.  Only the
RESOURCE path and offset are required.  The complete RESOURCE is printed.
Add -D to print other details.
(Also see sanlock direct read_leader.)

.BR "sanlock client add_lockspace -s" " LOCKSPACE"

Tell the sanlock daemon to acquire the specified host_id in the lockspace.
This will allow resources to be acquired in the lockspace.  The -o option
can be used to specify the io timeout of the acquiring host, and will be
written in the host_id lease.

.BR "sanlock client inq_lockspace -s" " LOCKSPACE"

Inquire about the state of the lockspace in the sanlock daemon, whether
it is being added or removed, or is joined.

.BR "sanlock client rem_lockspace -s" " LOCKSPACE"

Tell the sanlock daemon to release the specified host_id in the lockspace.
Any processes holding resource leases in this lockspace will be killed,
and the resource leases not released.

.BR "sanlock client command -r" " RESOURCE " \
\fB-c\fP " " \fIpath\fP " " \fIargs\fP

Register with the sanlock daemon, acquire the specified resource lease,
and exec the command at path with args.  When the command exits, the
sanlock daemon will release the lease.  -c must be the final option.

.BR "sanlock client acquire -r" " RESOURCE " \
\fB-p\fP " " \fIpid\fP
.br
.BR "sanlock client release -r" " RESOURCE " \
\fB-p\fP " " \fIpid\fP

Tell the sanlock daemon to acquire or release the specified resource lease
for the given pid.  The pid must be registered with the sanlock daemon.
acquire can optionally take a versioned RESOURCE string RESOURCE:lver,
where lver is the version of the lease that must be acquired, or fail.

.BR "sanlock client convert -r" " RESOURCE " \
\fB-p\fP " " \fIpid\fP

Tell the sanlock daemon to convert the mode of the specified resource
lease for the given pid.  If the existing mode is exclusive (default),
the mode of the lease can be converted to shared with RESOURCE:SH.  If the
existing mode is shared, the mode of the lease can be converted to
exclusive with RESOURCE (no :SH suffix).

.BI "sanlock client inquire -p" " pid"

Print the resource leases held the given pid.  The format is a versioned
RESOURCE string "RESOURCE:lver" where lver is the version of the lease
held.

.BR "sanlock client request -r" " RESOURCE " \
\fB-f\fP " " \fIforce_mode\fP

Request the owner of a resource do something specified by force_mode.  A
versioned RESOURCE:lver string must be used with a greater version than is
presently held.  Zero lver and force_mode clears the request.

.BR "sanlock client examine -r" " RESOURCE"

Examine the request record for the currently held resource lease and carry
out the action specified by the requested force_mode.

.BR "sanlock client examine -s" " LOCKSPACE"

Examine requests for all resource leases currently held in the named
lockspace.  Only lockspace_name is used from the LOCKSPACE argument.

.BR "sanlock client set_event -s" " LOCKSPACE " \
\fB-i\fP " " \fIhost_id\fP " " \
\fB-g\fP " " \fIgen\fP " " \
\fB-e\fP " " \fInum\fP " " \
\fB-d\fP " " \fInum\fP

Set an event for another host.  When the sanlock daemon next renews
its delta lease for the lockspace it will: set the bit for the host_id
in its bitmap, and set the generation, event and data values in its own
delta lease.  An application that has registered for events from this
lockspace on the destination host will get the event that has been set
when the destination sees the event during its next delta lease renewal.

.BR "sanlock client set_config -s" " LOCKSPACE

Set a configuration value for a lockspace.
Only lockspace_name is used from the LOCKSPACE argument.
The USED flag has the same effect on a lockspace as a process
holding a resource lease that will not exit.  The USED_BY_ORPHANS
flag means that an orphan resource lease will have the same effect
as the USED.
.br
\-u 0|1 Set (1) or clear (0) the USED flag.
.br
\-O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.

\fBsanlock client format -x\fP RINDEX

Create a resource index on disk.  Use -Z and -A to set the sector size
and align size to match the lockspace.

\fBsanlock client create -x\fP RINDEX \fB-e\fP \fIresource_name\fP

Create a new resource lease on disk, using the rindex to
find a free offset.

\fBsanlock client delete -x\fP RINDEX \fB-e\fP \fIresource_name\fP[:\fIoffset\fP]

Delete an existing resource lease on disk.

\fBsanlock client lookup -x\fP RINDEX \fB-e\fP \fIresource_name\fP

Look up the offset of an existing resource lease by name on disk, using
the rindex.  With no -e option, lookup returns the next free lease offset.
If -e specifes both name and offset, the lookup verifies both are correct.

\fBsanlock client update -x\fP RINDEX \fB-e\fP \fIresource_name\fP[:\fIoffset\fP] [\fB-z 0|1\fP]

Add (-z 0) or remove (-z 1) an rindex entry on disk.

\fBsanlock client rebuild -x\fP RINDEX

Rebuild the rindex entries by scanning the disk for resource leases.


.SS Direct Command

.B "sanlock direct"
.I action
[options]

./" non-aio is untested and may not work
./" .BR \-a " 0|1"
./" use async i/o

.BI -o " sec"
io timeout in seconds

.BR "sanlock direct init -s" " LOCKSPACE"
.br
.BR "sanlock direct init -r" " RESOURCE"

Initialize storage for a lockspace or resource.  Use the -Z and -A flags
to specify the sector size and align size.  The max hosts that can use the
lockspace/resource (and the max possible host_id) is determined by the
sector/align size combination.  Possible combinations are: 512/1M,
4096/1M, 4096/2M, 4096/4M, 4096/8M.  Lockspaces and resources both use the
same amount of space (align_size) for each combination.  When initializing
a lockspace, sanlock initializes delta leases for max_hosts in the given
space.  When initializing a resource, sanlock initializes a single paxos
lease in the space.  With -s, the -o option specifies the io timeout to be
written in the host_id leases.  With -r, the -z 1 option invalidates the
resource lease on disk so it cannot be used until reinitialized normally.

.BR "sanlock direct read_leader -s" " LOCKSPACE"
.br
.BR "sanlock direct read_leader -r" " RESOURCE"

Read a leader record from disk and print the fields.  The leader record is
the single sector of a delta lease, or the first sector of a paxos lease.
./" .P
./" .BR "sanlock direct acquire_id -s" " LOCKSPACE"
./" .br
./" .BR "sanlock direct renew_id -s" " LOCKSPACE"
./" .br
./" .BR "sanlock direct release_id -s" " LOCKSPACE"
./"
./" Acquire, renew, or release a host_id directly to disk, independent from
./" the sanlock daemon.  Not for general use.  This should only be used for
./" testing or for manual recovery in an emergency.
./"
./" .P
./" .BR "sanlock direct acquire -r" " RESOURCE " \
./" \fB-i\fP " " \fInum\fP " " \fB-g\fP " " \fInum\fP
./" .br
./" .BR "sanlock direct release -r" " RESOURCE " \
./" \fB-i\fP " " \fInum\fP " " \fB-g\fP " " \fInum\fP
./"
./" Not supported.  Not for general use.
./"

.BI "sanlock direct dump" " path" \
\fR[\fP\fB:\fP\fIoffset\fP\fR[\fP\fB:\fP\fIsize\fP\fR]]\fP

Read disk sectors and print leader records for delta or paxos leases.  Add
-f 1 to print the request record values for paxos leases, host_ids set
in delta lease bitmaps, and rindex entries.

\fBsanlock direct format -x\fP RINDEX
.br
\fBsanlock direct lookup -x\fP RINDEX \fB-e\fP \fIresource_name\fP
.br
\fBsanlock direct update -x\fP RINDEX \fB-e\fP \fIresource_name\fP[:\fIoffset\fP] [\fB-z 0|1\fP]
.br
\fBsanlock direct rebuild -x\fP RINDEX

Access the resource index on disk without going through the sanlock
daemon.  This precludes using the internal paxos lease to protect rindex
modifications.  See client equivalents for descriptions.


.SS
LOCKSPACE option string

.BR \-s " " \fIlockspace_name\fP:\fIhost_id\fP:\fIpath\fP:\fIoffset\fP
.P
.IR lockspace_name " name of lockspace"
.br
.IR host_id " local host identifier in lockspace"
.br
.IR path " path to storage to use for leases"
.br
.IR offset " offset on path (bytes)"
.br

.SS
RESOURCE option string

.BR \-r " " \fIlockspace_name\fP:\fIresource_name\fP:\fIpath\fP:\fIoffset\fP
.P
.IR lockspace_name " name of lockspace"
.br
.IR resource_name " name of resource"
.br
.IR path " path to storage to use leases"
.br
.IR offset " offset on path (bytes)"

.SS
RESOURCE option string with suffix

.BR \-r " " \fIlockspace_name\fP:\fIresource_name\fP:\fIpath\fP:\fIoffset\fP:\fIlver\fP
.P
.IR lver " leader version"

.BR \-r " " \fIlockspace_name\fP:\fIresource_name\fP:\fIpath\fP:\fIoffset\fP:SH
.P
SH indicates shared mode

.SS
RINDEX option string

\fB\-x\fP \fIlockspace_name\fP:\fIpath\fP:\fIoffset\fP
.P
.IR lockspace_name " name of lockspace"
.br
.IR path " path to storage to use for leases"
.br
.IR offset " offset on path (bytes) of rindex"


.SS Defaults

.B sanlock help
shows the default values for the options above.

.B sanlock version
shows the build version.

.SH OTHER

.SS Request/Examine

The first part of making a request for a resource is writing the request
record of the resource (the sector following the leader record).  To make
a successful request:
.IP \(bu 2
RESOURCE:lver must be greater than the lver presently held by the other
host.  This implies the leader record must be read to discover the lver,
prior to making a request.
.IP \(bu 2
RESOURCE:lver must be greater than or equal to the lver presently
written to the request record.  Two hosts may write a new request at the
same time for the same lver, in which case both would succeed, but the
force_mode from the last would win.
.IP \(bu 2
The force_mode must be greater than zero.
.IP \(bu 2
To unconditionally clear the request record (set both lver and
force_mode to 0), make request with RESOURCE:0 and force_mode 0.

.P

The owner of the requested resource will not know of the request unless it
is explicitly told to examine its resources via the "examine" api/command,
or otherwise notfied.

The second part of making a request is notifying the resource lease owner
that it should examine the request records of its resource leases.  The
notification will cause the lease owner to automatically run the
equivalent of "sanlock client examine -s LOCKSPACE" for the lockspace of
the requested resource.

The notification is made using a bitmap in each host_id delta lease.  Each
bit represents each of the possible host_ids (1-2000).  If host A wants to
notify host B to examine its resources, A sets the bit in its own bitmap
that corresponds to the host_id of B.  When B next renews its delta lease,
it reads the delta leases for all hosts and checks each bitmap to see if
its own host_id has been set.  It finds the bit for its own host_id set in
A's bitmap, and examines its resource request records.  (The bit remains
set in A's bitmap for set_bitmap_seconds.)

.I force_mode
determines the action the resource lease owner should take:

.IP \[bu] 2
FORCE (1): kill the process holding the resource lease.  When the
process has exited, the resource lease will be released, and can then be
acquired by anyone.  The kill signal is SIGKILL (or SIGTERM if SIGKILL is
restricted.)

.IP \[bu] 2
GRACEFUL (2): run the program configured by sanlock_killpath against
the process holding the resource lease.  If no killpath is defined, then
FORCE is used.

.P

.SS Persistent and orphan resource leases

A resource lease can be acquired with the PERSISTENT flag (-P 1).  If the
process holding the lease exits, the lease will not be released, but kept
on an orphan list.  Another local process can acquire an orphan lease
using the ORPHAN flag (-O 1), or release the orphan lease using the ORPHAN
flag (-O 1).  All orphan leases can be released by setting the lockspace
name (-s lockspace_name) with no resource name.

.P

.SS Renewal history

sanlock saves a limited history of lease renewal information in each lockspace.
See sanlock.conf renewal_history_size to set the amount of history or to
disable (set to 0).

IO times are measured in delta lease renewal (each delta lease renewal
includes one read and one write).

For each successful renewal, a record is saved that includes:
.IP \[bu] 2
the timestamp written in the delta lease by the renewal
.IP \[bu] 2
the time in milliseconds taken by the delta lease read
.IP \[bu] 2
the time in milliseconds taken by the delta lease write

.P
 
Also counted and recorded are the number io timeouts and
other io errors that occur between successful renewals.

Two consecutive successful renewals would be recorded as:
.br
.nf
timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0
.fi

Those fields are:
    
.IP \[bu] 2
timestamp is the value written into the delta lease during
that renewal.

.IP \[bu] 2
read_ms/write_ms are the milliseconds taken for the renewal
read/write ios.
    
.IP \[bu] 2
next_timeouts are the number of io timeouts that occured
after the renewal recorded on that line, and before the next
successful renewal on the following line.

.IP \[bu] 2
next_errors are the number of io errors (not timeouts) that
occured after renewal recorded on that line, and before the
next successful renewal on the following line.
    
.P

The command 'sanlock client renewal -s lockspace_name' reports
the full history of renewals saved by sanlock, which by default
is 180 records, about 1 hour of history when using a 20 second
renewal interval for a 10 second io timeout.

.SH INTERNALS

.SS Disk Format

.IP \[bu] 2
This example uses 512 byte sectors.
.IP \[bu] 2
Each lockspace is 1MB.  It holds 2000 delta_leases, one per sector,
supporting up to 2000 hosts.
.IP \[bu] 2
Each paxos_lease is 1MB.  It is used as a lease for one resource.
.IP \[bu] 2
The leader_record structure is used differently by each lease type.
.IP \[bu] 2
To display all leader_record fields, see sanlock direct read_leader.
.IP \[bu] 2
A lockspace is often followed on disk by the paxos_leases used within that
lockspace, but this layout is not required.
.IP \[bu] 2
The request_record and host_id bitmap are used for requests/events.
.IP \[bu] 2
The mode_block contains the SHARED flag indicating a lease is held in the
shared mode.
.IP \[bu] 2
In a lockspace, the host using host_id N writes to a single delta_lease in
sector N-1.  No other hosts write to this sector.  All hosts read all
lockspace sectors when renewing their own delta_lease, and are able to
monitor renewals of all delta_leases.
.IP \[bu] 2
In a paxos_lease, each host has a dedicated sector it writes to,
containing its own paxos_dblock and mode_block structures.  Its sector is
based on its host_id; host_id 1 writes to the dblock/mode_block in sector
2 of the paxos_lease.
.IP \[bu] 2
The paxos_dblock structures are used by the paxos_lease algorithm, and the
result is written to the leader_record.

.P

.B 0x000000 lockspace foo:0:/path:0

(There is no representation on disk of the lockspace in general, only the
sequence of specific delta_leases which collectively represent the
lockspace.)

.B delta_lease foo:1:/path:0
.nf
0x000 0         leader_record         (sector 0, for host_id 1)
                magic: 0x12212010 
                space_name: foo
                resource_name: host uuid/name
                \.\.\.
                host_id bitmap        (leader_record + 256)
.fi

.B delta_lease foo:2:/path:0
.nf
0x200 512       leader_record         (sector 1, for host_id 2)
                magic: 0x12212010
                space_name: foo
                resource_name: host uuid/name
                \.\.\.
                host_id bitmap        (leader_record + 256)
.fi

.B delta_lease foo:3:/path:0
.nf
0x400 1024      leader_record         (sector 2, for host_id 3)
                magic: 0x12212010
                space_name: foo
                resource_name: host uuid/name
                \.\.\.
                host_id bitmap        (leader_record + 256)
.fi

.B delta_lease foo:2000:/path:0
.nf
0xF9E00         leader_record         (sector 1999, for host_id 2000)
                magic: 0x12212010
                space_name: foo
                resource_name: host uuid/name
                \.\.\.
                host_id bitmap        (leader_record + 256)
.fi

.B 0x100000 paxos_lease foo:example1:/path:1048576
.nf
0x000 0         leader_record         (sector 0)
                magic: 0x06152010
                space_name: foo
                resource_name: example1

0x200 512       request_record        (sector 1)
                magic: 0x08292011

0x400 1024      paxos_dblock          (sector 2, for host_id 1)
0x480 1152      mode_block            (paxos_dblock + 128)

0x600 1536      paxos_dblock          (sector 3, for host_id 2)
0x680 1664      mode_block            (paxos_dblock + 128)

0x800 2048      paxos_dblock          (sector 4, for host_id 3)
0x880 2176      mode_block            (paxos_dblock + 128)

0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
0xFA280         mode_block            (paxos_dblock + 128)
.fi

.B 0x200000 paxos_lease foo:example2:/path:2097152
.nf
0x000 0         leader_record         (sector 0)
                magic: 0x06152010
                space_name: foo
                resource_name: example2

0x200 512       request_record        (sector 1)
                magic: 0x08292011

0x400 1024      paxos_dblock          (sector 2, for host_id 1)
0x480 1152      mode_block            (paxos_dblock + 128)

0x600 1536      paxos_dblock          (sector 3, for host_id 2)
0x680 1664      mode_block            (paxos_dblock + 128)

0x800 2048      paxos_dblock          (sector 4, for host_id 3)
0x880 2176      mode_block            (paxos_dblock + 128)

0xFA200         paxos_dblock          (sector 2001, for host_id 2000)
0xFA280         mode_block            (paxos_dblock + 128)
.fi

.SS Lease ownership

Not shown in the leader_record structures above are the owner_id,
owner_generation and timestamp fields.  These are the fields that define
the lease owner.

The delta_lease at sector N for host_id N+1 has leader_record.owner_id
N+1.  The leader_record.owner_generation is incremented each time the
delta_lease is acquired.  When a delta_lease is acquired, the
leader_record.timestamp field is set to the time of the host and the
leader_record.resource_name is set to the unique name of the host.  When
the host renews the delta_lease, it writes a new leader_record.timestamp.
When a host releases a delta_lease, it writes zero to
leader_record.timestamp.

When a host acquires a paxos_lease, it uses the host_id/generation value
from the delta_lease it holds in the lockspace.  It uses this
host_id/generation to identify itself in the paxos_dblock when running the
paxos algorithm.  The result of the algorithm is the winning
host_id/generation - the new owner of the paxos_lease.  The winning
host_id/generation are written to the paxos_lease leader_record.owner_id
and leader_record.owner_generation fields and leader_record.timestamp is
set.  When a host releases a paxos_lease, it sets leader_record.timestamp
to 0.

When a paxos_lease is free (leader_record.timestamp is 0), multiple hosts
may attempt to acquire it.  The paxos algorithm, using the paxos_dblock
structures, will select only one of the hosts as the new owner, and that
owner is written in the leader_record.  The paxos_lease will no longer be
free (non-zero timestamp).  Other hosts will see this and will not attempt
to acquire the paxos_lease until it is free again.

If a paxos_lease is owned (non-zero timestamp), but the owner has not
renewed its delta_lease for a specific length of time, then the owner
value in the paxos_lease becomes expired, and other hosts will use the
paxos algorithm to acquire the paxos_lease, and set a new owner.

.SH FILES

/etc/sanlock/sanlock.conf

.IP \[bu] 2
quiet_fail = 1
.br
See -Q

.IP \[bu] 2
debug_renew = 0
.br
See -R

.IP \[bu] 2
logfile_priority = 4
.br
See -L

.IP \[bu] 2
logfile_use_utc = 0
.br
Use UTC instead of local time in log messages.

.IP \[bu] 2
syslog_priority = 3
.br
See -S

.IP \[bu] 2
names_log_priority = 4
.br
Log resource names at this priority level (uses syslog priority numbers).
If this is greater than or equal to logfile_priority, each requested resource
name and location is recorded in sanlock.log.

.IP \[bu] 2
use_watchdog = 1
.br
See -w

.IP \[bu] 2
high_priority = 1
.br
See -h

.IP \[bu] 2
mlock_level = 1
.br
See -l

.IP \[bu] 2
sh_retries = 8
.br
The number of times to try acquiring a paxos lease when acquiring a shared
lease when the paxos lease is held by another host acquiring a shared lease.

.IP \[bu] 2
uname = sanlock
.br
See -U

.IP \[bu] 2
gname = sanlock
.br
See -G

.IP \[bu] 2
our_host_name = <str>
.br
See -e

.IP \[bu] 2
renewal_read_extend_sec = <seconds>
.br
If a renewal read i/o times out, wait this many additional seconds for that
read to complete at the start of the subsequent renewal attempt.  When not
configured, sanlock waits for an additional io_timeout seconds for a previous
timed out read to complete.

.IP \[bu] 2
renewal_history_size = 180
.br
See -H

.IP \[bu] 2
paxos_debug_all = 0
.br
Include all details in the paxos debug logging.

.IP \[bu] 2
debug_io = <str>
.br
Add debug logging for each i/o.  "submit" (no quotes) produces debug
output at submission time, "complete" produces debug output at completion
time, and "submit,complete" (no space) produces both.

.IP \[bu] 2
max_sectors_kb = <str>|<num>
.br
Set to "ignore" (no quotes) to prevent sanlock from checking or
changing max_sectors_kb for the lockspace disk when starting a lockspace.
Set to "align" (no quotes) to set max_sectors_kb for the lockspace disk
to the align size of the lockspace.
Set to a number to set a specific number of KB for all lockspace disks.

.IP \[bu] 2
debug_clients = 0
.br
Enable or disable debug logging for all client connections to the sanlock daemon.

.IP \[bu] 2
debug_cmd = +|-<name>
.br
Enable (+name) or disable (-name) debug logging at the command processing
level for specifically named commands, e.g. "debug_cmd = +acquire", or
"debug_cmd = -inq_lockspace".  Repeat this line for each command name.
Use a plus prefix before the name to enable and a minus prefix to disable.
By default sanlock disables some command level debugging for commands that
are often repetitive and fill the in memory debug buffer.
This only affects debug logging, not errors or warnings, and disabling
command level debugging for a command does not disable lower level debugging
for that command.  Special values +all and -all can be used to
enable or disable all commands, and can be used before or after other
debug_cmd lines.

.SH SEE ALSO
.BR wdmd (8)

